Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882

Closed
wants to merge 5 commits into from
Closed

[SPARK-22672][SQL][TEST] Refactor ORC Tests #19882

wants to merge 5 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Dec 4, 2017

What changes were proposed in this pull request?

Since SPARK-20682, we have two OrcFileFormats. This PR refactors ORC tests with three principles (with a few exceptions)

  1. Move test suite into sql/core.
  2. Create HiveXXX test suite in sql/hive by reusing sql/core test suite.
  3. OrcTest will provide common helper functions and val orcImp: String.

Test Suites

Native OrcFileFormat

  • org.apache.spark.sql.hive.orc
    • OrcFilterSuite
    • OrcPartitionDiscoverySuite
    • OrcQuerySuite
    • OrcSourceSuite
  • o.a.s.sql.hive.orc
    • OrcHadoopFsRelationSuite

Hive built-in OrcFileFormat

  • o.a.s.sql.hive.orc
    • HiveOrcFilterSuite
    • HiveOrcPartitionDiscoverySuite
    • HiveOrcQuerySuite
    • HiveOrcSourceSuite
    • HiveOrcHadoopFsRelationSuite

Hierarchy

OrcTest
    -> OrcSuite
        -> OrcSourceSuite
    -> OrcQueryTest
        -> OrcQuerySuite
    -> OrcPartitionDiscoveryTest
        -> OrcPartitionDiscoverySuite
    -> OrcFilterSuite

HadoopFsRelationTest
    -> OrcHadoopFsRelationSuite
        -> HiveOrcHadoopFsRelationSuite

Please note the followings.

  • Unlike the other test suites, OrcHadoopFsRelationSuite doesn't inherit OrcTest. It is inside sql/hive like ParquetHadoopFsRelationSuite due to the dependencies and follows the existing convention to use val dataSourceName: String
  • OrcFilterSuites cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.

How was this patch tested?

Pass the Jenkins tests with reorganized test suites.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan , @gatorsmile , @HyukjinKwon , @viirya .
This is a test case restructure after #19651 .

@SparkQA
Copy link

SparkQA commented Dec 5, 2017

Test build #84443 has finished for PR 19882 at commit 5f2025a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcFilterSuite extends OrcTest with SharedSQLContext
  • implicit class IntToBinary(int: Int)
  • case class OrcParData(intField: Int, stringField: String)
  • case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
  • abstract class OrcPartitionDiscoveryTest extends OrcTest
  • class OrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with SharedSQLContext
  • case class AllDataTypesWithNonPrimitiveType(
  • case class BinaryData(binaryData: Array[Byte])
  • case class Contact(name: String, phone: String)
  • case class Person(name: String, age: Int, contacts: Seq[Contact])
  • abstract class OrcQueryTest extends OrcTest
  • test(\"Creating case class RDD table\")
  • test(\"save and load case class RDD withNones as orc\")
  • class OrcQuerySuite extends OrcQueryTest with SharedSQLContext
  • case class OrcData(intField: Int, stringField: String)
  • abstract class OrcSuite extends OrcTest with BeforeAndAfterAll
  • class OrcSourceSuite extends OrcSuite with SharedSQLContext
  • abstract class OrcTest extends QueryTest with SQLTestUtils
  • class HiveOrcFilterSuite extends OrcTest with TestHiveSingleton
  • implicit class IntToBinary(int: Int)
  • class HiveOrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
  • class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton
  • class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton
  • class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite

@HyukjinKwon
Copy link
Member

Whoa big class list. Will take a look soon within tomorrow as well.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @HyukjinKwon !

emptyDF.createOrReplaceTempView("empty")

// This creates 1 empty ORC file with Hive ORC SerDe. We are using this trick because
// Spark SQL ORC data source always avoids write empty ORC files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still using Hive ORC SerDe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @viirya . I'll update tomorrow.

@cloud-fan
Copy link
Contributor

OrcTest will provide common helper functions and def format: String.

Instead of having def format: String, can we just add beforeAll and afterAll in the test suites to set the ORC_IMPLEMENTATION?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 5, 2017

It has more lines, doesn't it? In any way, we need helper functions.
And, when we remove old Hive OrcFileFormat and the conf later, this will reduce the change.

@cloud-fan
Copy link
Contributor

ok maybe have a def orcImp: String, which can be native or hive. Then we can put the beforeAll and afterAll in OrcTest.

It can avoid changing the test code from spark.read.orc to spark.read.format(format)

@dongjoon-hyun
Copy link
Member Author

Okay. No problem. Thanks, @cloud-fan .

/**
* A test suite that tests Apache ORC filter API based filter pushdown optimization.
*/
class OrcFilterSuite extends OrcTest with SharedSQLContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let HiveOrcFilterSuite extend OrcFilterSuite?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, it's impossible because of the reason I mentioned in PR description.

OrcFilterSuite and HiveOrcFilterSuite cannot reuse test cases due to the different function signatures using Hive 1.2.1 ORC classes and Apache ORC 1.4.1 classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.

@SparkQA
Copy link

SparkQA commented Dec 6, 2017

Test build #84524 has finished for PR 19882 at commit baec5fe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcFilterSuite extends OrcTest with SharedSQLContext
  • implicit class IntToBinary(int: Int)
  • case class OrcParData(intField: Int, stringField: String)
  • case class OrcParDataWithKey(intField: Int, pi: Int, stringField: String, ps: String)
  • abstract class OrcPartitionDiscoveryTest extends OrcTest
  • case class AllDataTypesWithNonPrimitiveType(
  • case class BinaryData(binaryData: Array[Byte])
  • case class Contact(name: String, phone: String)
  • case class Person(name: String, age: Int, contacts: Seq[Contact])
  • abstract class OrcQueryTest extends OrcTest
  • test(\"Creating case class RDD table\")
  • test(\"save and load case class RDD withNones as orc\")
  • class OrcQuerySuite extends OrcQueryTest with SharedSQLContext
  • case class OrcData(intField: Int, stringField: String)
  • abstract class OrcSuite extends OrcTest with BeforeAndAfterAll
  • class OrcSourceSuite extends OrcSuite with SharedSQLContext
  • abstract class OrcTest extends QueryTest with SQLTestUtils with BeforeAndAfterAll
  • class HiveOrcFilterSuite extends OrcTest with TestHiveSingleton
  • implicit class IntToBinary(int: Int)
  • class HiveOrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
  • class HiveOrcQuerySuite extends OrcQueryTest with TestHiveSingleton
  • class HiveOrcSourceSuite extends OrcSuite with TestHiveSingleton
  • class HiveOrcHadoopFsRelationSuite extends OrcHadoopFsRelationSuite

read
.option(ConfVars.DEFAULTPARTITIONNAME.varname, defaultPartitionName)
spark.read
.option("hive.exec.default.partition.name", defaultPartitionName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new ORC didn't change these config names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In fact, Apache ORC doesn't have this params.

import testImplicits._

def orcImp: String = "native"

var originalConfORCImplementation = "native"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.


override def orcImp: String = "hive"

test("SPARK-8501: Avoids discovery schema from empty ORC files") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this test not in native orc test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Native ORC solve this bug. Native ORC have a corresponding test case here.

+  test("Schema discovery on empty ORC files") {
+    // SPARK-8501 is fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then why don't we put this test in the base class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works in new OrcFileFormat.

  • The new test case is in OrcQuerySuite for new OrcFileFormat.
  • And, the old test case is HiveOrcQuerySuite for old OrcFileFormat.


spark.sql(
s"""CREATE TEMPORARY VIEW normal_orc_source
|USING org.apache.spark.sql.hive.orc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be using orc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. It just comes from the original test case.

}
}

test("SPARK-21791 ORC should support column names with dot") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we need this test case for Hive's one too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old OrcFileFormat fails on this test case.
Do you mean adding an exception-catching test case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I overlooked. Sure, that's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

override def beforeAll(): Unit = {
class OrcSourceSuite extends OrcSuite with SharedSQLContext {

protected override def beforeAll(): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind if I ask where this is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases of OrcSuite assume these tables.

import testImplicits._

def orcImp: String = "native"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe val could be used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Done.

@dongjoon-hyun
Copy link
Member Author

It's rebased to the master to resolve conflicts. Also, I addressed the comments. Thanks!

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loosely related though, should we maybe rename org.apache.spark.sql.hive.orc.Orc* -> org.apache.spark.sql.hive.orc.HiveOrc* in the main codes too to distinguish the newer ORC from the old Hive ORC?

/**
* A test suite that tests Apache ORC filter API based filter pushdown optimization.
*/
class OrcFilterSuite extends OrcTest with SharedSQLContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we leave some comments to explain that reason?
Seems there are many duplications and I would wonder why.


test("filter pushdown - combinations with logical operators") {
withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i)))) { implicit df =>
// Because `ExpressionTree` is not accessible at Hive 1.2.x, this should be checked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the original tests here like this using toString partly because ExpressionTree (SearchArgument.getExpression) it's inaccessible and the string format is easy to read.

Although I think that tree seems available in the ORC, I think it's okay to keep the tests like this. It's easy to read but let's fix up the comments here. It doesn't look about Hive anyway here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll remove the comment. For the test case, I agree with you. And also this string tests will be consistent with Hive Orc tests for a while.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@HyukjinKwon
Copy link
Member

LGTM BTW.

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84581 has finished for PR 19882 at commit fcb2ccb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @HyukjinKwon !

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Dec 7, 2017

@HyukjinKwon . The PR code and description is updated like the followings.

  • Update comments
  • Rename back to the originals

The main reason I used prefix Hive is for naming consistency, but now it's restored.

As you see in the PR description, we need to use HiveOrcHadoopFsRelationSuite
because OrcHadoopFsRelationSuite already exists in the same package.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Dec 7, 2017

I actually suggested similarly before:

org.apache.spark.sql.execution.datasources.csv.InferSchema
org.apache.spark.sql.execution.datasources.json.InferSchema

but I remember I received an advise at that time and it became as below:

org.apache.spark.sql.execution.datasources.csv.CSVInferSchema
org.apache.spark.sql.execution.datasources.json.JsonInferSchema

After rethinking it, I realised this is better.

Likewise, I actually liked Hive prefix. It was easier to distinguish.

@HyukjinKwon
Copy link
Member

I meant in #19882 (review), I liked this Hive prefix here so wondered if we could do the same thing for the main codes too. Since this PR only refactors the tests, it's loosely related though.

@dongjoon-hyun
Copy link
Member Author

Oh. I'll bring back.

@HyukjinKwon
Copy link
Member

I am sorry, I had to clarify this ahead ..

@dongjoon-hyun
Copy link
Member Author

Definitely, my bad.

For the main code, we can do later in a separate PR if needed. I hope this PR contains tests only.

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84589 has finished for PR 19882 at commit 1828571.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class OrcFilterSuite extends OrcTest with TestHiveSingleton
  • implicit class IntToBinary(int: Int)
  • class OrcPartitionDiscoverySuite extends OrcPartitionDiscoveryTest with TestHiveSingleton
  • class OrcQuerySuite extends OrcQueryTest with TestHiveSingleton
  • class OrcSourceSuite extends OrcSuite with TestHiveSingleton

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84591 has finished for PR 19882 at commit f56423d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 7, 2017

Test build #84599 has finished for PR 19882 at commit f56423d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in c1e5688 Dec 7, 2017
@dongjoon-hyun
Copy link
Member Author

Thank you so much, @cloud-fan , @HyukjinKwon , and @gatorsmile !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-22672 branch December 7, 2017 15:20
asfgit pushed a commit that referenced this pull request Dec 9, 2017
## What changes were proposed in this pull request?

During #19882, `conf` is mistakenly used to switch ORC implementation between `native` and `hive`. To affect `OrcTest` correctly, `spark.conf` should be used.

## How was this patch tested?

Pass the tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19931 from dongjoon-hyun/SPARK-22672-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants